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Abstract 

Background: Fluorescent and luminescent gene reporters allow us to dynamically quantify changes in molecular 
species concentration over time on the single cell level. The mathematical modeling of their interaction through 
multivariate dynamical models requires the development of effective statistical methods to calibrate such models 
against available data. Given the prevalence of stochasticity and noise in biochemical systems inference for 
stochastic models is of special interest. In this paper we present a simple and computationally efficient algorithm 
for the estimation of biochemical kinetic parameters from gene reporter data. 

Results: We use the linear noise approximation to model biochemical reactions through a stochastic dynamic model 
which essentially approximates a diffusion model by an ordinary differential equation model with an appropriately 
defined noise process. An explicit formula for the likelihood function can be derived allowing for computationally 
efficient parameter estimation. The proposed algorithm is embedded in a Bayesian framework and inference is 
performed using Markov chain Monte Carlo. 

Conclusions: The major advantage of the method is that in contrast to the more established diffusion approxima- 
tion based methods the computationally costly methods of data augmentation are not necessary. Our approach 
also allows for unobserved variables and measurement error. The application of the method to both simulated and 
experimental data shows that the proposed methodology provides a useful alternative to diffusion approximation 
based methods. 

Supplementary Information (SI): Available at |http://www.warwick.ac.uk/s taff/M.Komorowski/LNASI.pdf 



1 Background 

The estimation of parameters in biokinetic models from experimental data is an important problem in 
Systems Biology. In general the aim is to calibrate the model so as to reproduce experimental results in 
the best possible way. The solution of this task plays a key role in interpreting experimental data in the 
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context of dynamic mathematical models and hence in understanding the dynamics and control of complex 
intracellular chemical networks and the construction of synthetic regulatory circuits [1]. Among biochemical 
kinetic systems, the dynamics of gene expression and of gene regulatory networks are of particular interest. 
Recent developments of fluorescent microscopy allow us to quantify changes in protein concentration over 
time in single cells (e.g. [2.3]) even with single molecule precision (see [4] for review). Therefore an abundance 
of data is becoming available to estimate parameters of mathematical models in many important cellular 
systems. 

Single cell imaging techniques have revealed the stochastic nature of biochemical reactions (see [5, 6] for 
review) that most often occur far from thermodynamic equilibrium [7] and may involve small copy numbers of 
reacting macromolecules [8] . This inherent stochasticity implies that the dynamic behaviour of one cell is not 
exactly reproducible and that there exists stochastic heterogeneity between cells. The disparate biological 
systems, experimental designs and data types impose conditions on the statistical methods that should be 
used for inference [9-11]. From the modeling point of view the current common consensus is that the most 
exact stochastic description of the biochemical kinetic system is provided by the chemical master equation 
(CME) [12]. Unfortunately, for many tasks such as inference the CME is not a convenient mathematical 
tool and hence various types of approximations have been developed. The three most commonly used 
approximations are [13]: 

1 . The macroscopic rate equation (MRE) approach which describes the thermodynamic limit of the system 
with ordinary differential equations and does not take into account random fluctuations due to the 
stochasticity of the reactions. 

2. The diffusion approximation (DA) which provides stochastic differential equation (SDE) models where 
the stochastic perturbation is introduced by a state dependant Gaussian noise. 

3. The linear noise approximation (LNA) which can be seen as a combination because it incorporates the 
deterministic MRE as a model of the macroscopic system and the SDEs to approximatively describe 
the fluctuations around the deterministic state. 

Statistical methods based on the MRE have been most widely studied [9, 14-16]. They require data based 
on large populations. The main advantages of this method are its conceptual simplicity and the existence 
of extensive theory for differential equations. However, single cells experiments and studies of noise in 
small regulatory networks created the need for statistical tools that are capable to extract information 
from fluctuations in molecular species. Two methods have been proposed to address this. The one by 
[17] assumes availability of single molecule precision data. Another approach is based on the diffusion 
approximation [10, 18]. This uses likelihood approximation methods (e.g. [19]) that are computationally 
intensive and require sampling from high dimensional posterior distributions. Inference using these methods 
is particularly difficult for low frequency data with unobserved model variables [11, 18]. The aim of this 
study is to investigate the use of the LNA as a method for inference about kinetic parameters of stochastic 
biochemical systems. We find that the LNA approximation provides an explicit Gaussian likelihood for 
models with hidden variables and measurement error and is therefore simpler to use and computationally 
efficient. To account for prior information on parameters our methodology is embedded in the Bayesian 
paradigm. The paper is structured as follows: We first provide a description of the LNA based modeling 
approach and then formulate the relevant statistical framework. We then study its applicability in four 
examples, based on both simulated and experimental data, that clarify principles of the method. 

2 Methods 

The chemical master equation (CME) is the primary tool to model the stochastic behaviour of a reacting 
chemical system. It describes the evolution of the joint probability distribution of the number of different 
molecular species in a spatially homogeneous, well stirred and thermally equilibrated chemical system [12]. 
Even though these assumptions are not necessarily satisfied in living organisms the CME is commonly 
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regarded as the most realistic model of biochemical reactions inside living cells. Consider a general system 
of N chemical species inside a volume and let X = (Xi, . . . , X^) T denote the number and x = X/fi the 
concentrations of molecules. The stoichiometry matrix S = {Sij}i=i 2...JV; 3=1,2. ..R describes changes in the 
population sizes due to R different chemical events, where each Sij describes the change in the number of 
molecules of type i from Xi to Xi + Sij caused by an event of type j. The probability that an event of type 
j occurs in the time interval [t, t + dt) equals fj(?i,£l,t)£ldt. The functions /j(x, f2,f) are called mesoscopic 
transition rates. This specification leads to a Poisson birth and death process where the probability /i(X, i) 
that the system is in the state X at time t is described by the CME [13] which is given in the SI. 

The first order terms of a Taylor expansion of the CME in powers of l/\/f2 are given by the following 
MRE (see SI) 

R 

^=$>«/ifo*) i = l,2,...,JV; (1) 
j=i 

where fa = limn^oo.x—oo Xi/Q,, ip = (fa, . . . , <j) N ) T and fj(<p, t) = limn-^ fj(x, f2, t). 
Including also the second order terms of this expansion produces the LNA 

x(t)=p(t) + fi-*£(t) (2) 

which decomposes the state of the system into a deterministic part (p as solution of the MRE in ([!]) and a 
stochastic process £ described by an Ito diffusion equation 

df (i) = A(t)£df + E(t)dW, (3) 

where cW denotes increments of a Wiener process, [A(t)] jfe = X^jLi Sijdfj l^fa, [E(i)] ifc = Siky/ fk(<P,t) 
and /j = /i(<^) (see SI for derivation). 

The rationale behind the expansion in terms of 1/ V" is that for constant average concentrations relative 
fluctuations will decrease with the inverse of the square root of volume [20] . Therefore the LNA is accurate 
when fluctuations are sufficiently small in relation to the mean (large £1). Hence, the natural measure of 
adequacy of the LNA is the coefficient of variation i.e. ratio of the standard deviation to the mean (see SI). 
Validity of this approximation is also discussed in details in [20,21]. In addition it can be shown that the 
process describing the deviation from the deterministic state fis (x — ip) converges weakly to the diffusion (jij 
as — > 00 [22] . In order to use the LNA in a likelihood based inference method we need to derive transition 
densities of the process x. 



2.1 Transition densities 

The LNA provides solutions that are numerically or analytically tractable because the MRE in (JT|) can be 
solved numerically and the linear SDE in Q for an initial condition = has a solution of the form [23] 



£(t) = $ t< (t-*0 (tu + J ^ ti (s-t i )- 1 B(s)dW(s) 



(4) 



where the integral is in the Ito sense and <I> ti (s) is the fundamental matrix of the non-autonomous system 
of ODEs 

a -^=A(t i + s)^ ti , $ ti (0)=7. (5) 
as 

Equations Q , ^ imply that the transition densities [24] of the process £ are Gaussiarj^J [24] 

p(6 4 |e ti _ 1 ,e)=^(^|Mi-i,S i _ 1 ) (6) 



throughout the paper we use 'Gaussian' or 'normal' shortly to denote either a univariate or a multivariate normal distri- 
bution depending on the context. 
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where denotes a vector of all model parameters, ^(-l/Xj-i, 5j_i) is the normal density with mean and 
variance 3j_i specified by 

= ^^(Aj-i)^, A i _ 1 = i i -ii_i, (7) 
- s)E(s))(^ s (ti - s)E(s)) T ds 



It follows from ([2| and ([6]) that the transition densities of x are normal 

P (x ti \* ti _ 1 , e) = v(x tl \<p(u) + n -1 3<-i). (8) 

The properties of the normal distribution allow us to derive an explicit formula for the likelihood of observed 
data. 



2.2 Inference 

It is rarely possible to observe the time evolution of all molecular components participating in the system of 
interest [25]. Therefore, we partition the process x t into those components y< that are observed and those 
z t that are unobserved. 

Let x = (x to , . . . ,Xt n ), y = (yt , . . . , yt„) and z = (z to , . . . ,z tn ) denote the time-series that comprise 
the valucfj^Jof processes x, y and z, respectively, at times to, ■ ■ ■ tn- 

Our aim is to estimate the vector of unknown parameters O from a sequence of measurements y. Given 
the Markov property of the process x the augmented likelihood P(y,z|0) is given by 

n 

P(y,z|0) = []p(x 4i |x t( _ 1) e)p(x to |e) ) (9) 

i=i 

where p(xt ( |X{ i _ 1 , 0) are Gaussian densities specified in (|8j). For mathematical convenience we assume that 
p(x( 10) is also normal with mean tp(to) and covariancc matrix S_i. This assumption is justified as equations 
([2]) and ([3]) imply normal distribution at any time given a fixed initial condition. Mean (p(to) and covariance 
matrix S_i are parameterized as elements of 0. 

It can then be shown that (see SI) x is Gaussian. Therefore 

F(y,z|0)=V(x|^o),...,^„),E), (10) 

where ip(-\ip(to), . . . , tp(t n ), 2) is Gaussian with mean vector (yj(trj), . . . , <p(t n )) and covariance matrix S whose 
elements can be calculated numerically in a straightforward way (see SI) . Since the marginal distributions are 
also Gaussian it follows that the likelihood function P(y|0) can be obtained from the augmented likelihood 
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^(y|0) = V(y|K(t ), • ■ ■ , ¥>„(*«)),£), (11) 

where the covariance matrix S = {S^ J - ) }i J= o,...,n is a sub-matrix of S such that U'*'^ = Cov(yt i ,yt j ) and 
ip v is the vector consisting of the observed components of tp. 

Fluorescent reporter data are usually assumed to be proportional to the number of fluorescent molecules 
[26] and measurements are subject to measurement error, i.e. errors that do not influence the stochastic 
dynamics of the system. We therefore assume that instead of the matrix y our data have the form u = 
Ay + (et , ...,££„). The parameter A is a proportionality constant]^] and e ti denotes a random vector for 
additive measurement error. For mathematical convenience we assume that the joint distribution of the 
measurement error is normal with mean and known covariance matrix S e , i.e. (e to , . . . , e tii ) ~ N(0, E c ). If 



2 Here and throughout the paper we use the same letter to denote the stochastic process and its realization. 

,5 It is straightforward to generalize for the case with different proportionality constants for different molecular components. 
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measurement errors are independent with a constant variance of then S e = a~I. Equation (11 1 implies that 
the likelihood function can be written as 



P(u|0) = V(u|A(<^(* ), • • • , <p v (t n )), A 2 S + E e ). 



(12) 



Since for given data u the likelihood function ( 12 1 can be numerically evaluated any likelihood based inference 
is straightforward to implement. Using Bayes' theorem, the posterior distribution P(0|u) satisfies the 
relation [27] 

F(6|u) oc P(u|0)7r(0). (13) 

We use the standard Metropolis-Hastings (MH) algorithm [27] to sample from the posterior distribution in 
(pi 



3 Results 

In order to study the use of the LNA method for inference we have selected four examples which are 
related to commonly used quantitative experimental techniques such as measurements based on reporter 
gene constructs and reporter assays based on Polymerase Chain Reaction (e.g. RT-PCR, Q-PCR). For 
expository reasons, all case studies consider a model of single gene expression. 



3.1 Model of single gene expression 

Although gene expression involves various biochemical reactions it is essentially modeled in terms of only 
three biochemical species (DNA, mRNA, protein) and four reaction channels (transcription, mRNA degrada- 
tion, translation, protein degradation) [28-30]. Let x = (r,p) denote concentrations of mRNA and protein, 
respectively. The stoichiometry matrix has the form 

S ={ 1 -ij. ( 14 ) 

where rows correspond to molecular species and columns to reaction channels. For the reaction rates 

~t(x) = (k R (t) nR r,k P r nP p) T (15) 

we can derive the following macroscopic rate equations 

= k R (t) - j R (j)R, 4>p = k P (f>R - 7p0p. (16) 

For the general case it is assumed that the transcription rate k R (t) is time-dependent, reflecting changes in 
the regulatory environment of the gene such as the availability of transcription factors or chromatin structure. 
Using (15) and (16 1 in ^ we obtain the following SDEs describing the deviation from the macroscopic 



state 



d^R = -lR.ZR.dt + y/k R {t) + lR (t> R {t)dW R , (17) 
d^ P = (kp^ R - lp£p)dt + yJk P (j) R {t) + jp(j}p(t)dW P . 



We will refer to the model in ( 16 1 and ( 17 1 as the simple model of single gene expression. 

In order to test our method on a nonlinear system we will also consider the case of an autregulated 
network where the transcription rate of the gene is a function of a modified form of the protein that the 
gene codes for and where the modified protein is a transcription factor that inhibits the production of its 
own mRNA. This is parameterized by a Hill function [28] k R (t,p) = k R (t)/(l + (p/H) nH ) where k R (t) now 
describes the maximum rate of transcription, H is a dissociation constant and nn is a Hill coefficient. Thus, 
the nonlinear autoregulatory model the system is described by the MRE 

(f>R. = k R {t,<j)p) - j R phi R , §p = k P (f)R - Jp<frp (18) 
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and the SDEs 



<%p = (fcpCfl - 7P^p)d* + Vkp(f> R (t) + lP (f>p(t)dWp (19) 

where fc^(t) = dkp(t, 4>p)/d(j)p. We refer to this model as the autoregulatory model of single gene expression. 
The two models constitute the basis of our inference studies below. 



3.2 Inference from fluorescent reporter gene data for the simple model of single gene expression 

To test the algorithm we first use the simple model of single gene expression. We generate data according 



to the stoichiometry matrix (14 1 and rates (151 using Gillespie's algorithm [31] and sample it at discrete 
time points. We then generate artificial data that are proportional to the simulated protein data with added 
normally distributed measurement error with known variance a\. Furthermore we assume that mRNA levels 
are unobserved. Thus the data are of the forrrjf] 

u=(u t0 ,...,u tn ) T , (20) 

where u t - = Xpti + £*; , Pt t is the simulated protein concentration, A is an unknown proportionality constant 
and e ti is measurement error. For the purpose of our example we model the transcription function by 

f &r J exp(-6 1 (t-& 3 ) 2 )+&4 t<b 3 
k R (t) = < (21) 
{ b Q exp(~b 2 (t~b 3 ) 2 )+b 4 t>b 3 

This form of transcription corresponds to an experiment, where transcription increases for t < b 3 as a result 
of being induced by an environmental stimulus and for t > 63 decreases towards a baseline level 64. 
We assume that at time to (to << b 3 ) the system is in a stationary state. Therefore, the initial condition of 
the MRE is a function of unknown parameters (4>Ft(to),<fip(to)) = (64/7/?, b^kp /^jpsip). 

To ensure identifiability of all model parameters we assume that informative prior distributions for both 
degradation rates are available. Priors for all other parameters were specified to be non-informative. 

To infer the vector of unknown parameters 

© = {jB.,lP,kp,\,bo,b 1 ,b2,b3,b 4 ) 
we sample from the posterior distribution 

P(6|u) cx P(u|6)7r(6) 



using the standard MH algorithm. The distribution P(u|0) is given by (12 1. 

The protein level of the simulated trajectory is sampled every 15 minutes and a sample size of 101 points 
obtained. We perform inference for two simulated data sets: estimate 1 is based on a single trajectory while 
estimate 2 represents a larger data set using 20 sampled trajectories (see Figure 1A). All prior specifications, 
parameters used for the simulations and inference results are presented in Table 1A. 

Estimate 1 demonstrates that it is possible to infer all parameters from a single, short length time series 
with a realistically achievable time resolution. Estimate 2 shows that usage of the LNA does not seem 
to result in any significant bias. A bias has not been detected despite the very small number of mRNA 
molecules (5 to 35 - Figure 2A in the SI) and protein molecules (100 to 500 - Figure 1A). The coefficient of 
variation varied between approximately 0.15 and 0.4 for both molecular species (Figure 1 in the SI). 

Inference for this model required sampling from the 9 dimensional posterior distribution (number of 
unknown parameters) . If instead one used a diffusion approximation based method it would be necessary 
to sample from a posterior distribution of much higher dimension (see SI). In addition, incorporation of the 
measurement error is straightforward here, whereas for other methods it involves a substantial computational 
cost [18]. 

4 The volume of the system f2 is unknown and we set Q = 1 so that concentration equals the number of molecules. 
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(A) 



Par am. 


Prior 


Value 


Estimate 1 


Estimate 2 


IR 


FCO 44 10~ 2s l 


0.44 


43 (0 27-0 60 s ! 


49 CO 38-0 61 "l 

IVJ.OO U.U1 1 


IP 


r(o.52,io- 2 ) 


0.52 


0.51 (0.35-0.67) 


0.49 (0.38-0.61) 


kp 


Exp(lOO) 


10.00 


21.09 (3.91-67.17) 


11.41 (7.64-16.00) 


A 


Exp(lOO) 


1.00 


1.42 (0.60-2.57) 


1.08 (0.76-1.36) 


b 


Exp(lOO) 


15.00 


6.80 (0.97-18.43) 


12.78 (9.80-15.33) 


h 


Exp(l) 


0.40 


0.79 (0.05-3.02) 


0.29 (0.18-0.43) 


b 2 


Exp(l) 


0.40 


0.77 (0.08-2.79) 


0.77 (0.32-1.59) 


b 3 


Exp(lO) 


7.00 


6.13 (4.41-7.85) 


7.25 (6.79-7.55) 


W 


Exp(lOO) 


3.00 


0.94 (0.11-2.88) 


2.87 (2.11-3.52) 


(B) 










Par am. 


Prior 


Value 


Estimate 1 


Estimate 2 


IR 


r(0.44,10" 2 ) 


0.44 


0.44 (0.27-0.60) 


0.42 (0.32-0.54) 


7p 


r(o.52,io- 2 ) 


0.52 


0.49 (0.33-0.65) 


0.49 (0.36-0.61) 


kp 


Exp(lOO) 


10.00 


14.86 (3.18-47.97) 


9.35 (5.87-13.15) 


A 


Exp(lOO) 


1.00 


1.26 (0.48-2.30) 


1.15 (0.81-1.50) 


b 


Exp(lOO) 


15.00 


5.99 (0.20-23.06) 


13.47 (9.24-17.13) 


h 


Exp(l) 


0.40 


0.59 (0.01-2.75) 


0.27 (0.14-0.53) 




Exp(l) 


0.40 


0.92 (0.05-2.92) 


0.83 (0.21-3.52) 


b 3 


Exp(lO) 


7.00 


6.53(0.74-14.69) 


7.27 (6.44-7.79) 


W 


Exp(lOO) 


3.00 


2.85 (0.35-7.19) 


2.64 (1.82-3.32) 



Table 1: Inference results for (A) the simple model and (B) autoregulatory model of single gene expression 
Parameter values used in simulation, prior distribution, posterior medians and 95% credibility intervals. 
Estimate 1 corresponds to inference from single time series, Estimate 2 relates to estimates obtained from 20 
independent time series. Data used for inference are plotted in Figure [l|V for case A and Figure [l]3 for case 
B. Variance of the measurement error was assumed to be known a e = 9. Rates are per hour. Parameters 
are tih = 1, H = 61.98 in case B. 
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Figure 1: Protein timeseries generated using Gillespie's algorithm for the simple A and autoregulatory B 
models of single gene expression with added measurement error (a\ = 9). Initial conditions for mRNA and 
protein were sampled from Poisson distributions with means equal to the stationary means of the system 
with equal constant transcription rate 64. In the autoregulatory case we set H — b^kp /2^p^/p. In each panel 
20 time series are presented. The deterministic and average trajectories are plotted in bold solid and dashed 
lines respectively. Corresponding mRNA trajectories (not used for inference) are presented in the SI. 

3.3 Inference from fluorescent reporter gene data for the model of single gene expression with au- 
toregulation 

The following example considers the autoregulatory system with only a small number of reacting molecules. 
Using Gillespie's algorithm we generate artificial data from the single gene expression model with autoregu- 
lation. The protein time courses were then sampled every 15 minutes at 101 discrete points per trajectory 
(see Figure IB). As before we assume that the mRNA time courses are not observed and that the protein 



data are of the form given in (20), i.e. proportional to the actual amount of protein with additive Gaussian 
measurement error. As in the previous case study we estimate parameters from two simulated data sets, 
a single trajectory and an ensemble of 20 independent trajectories. The inference results summarized in 
Table IB show that despite the low number of mRNA (0-15 molecules, see Fig. 2 in SI) and protein (10-250 
molecules, see Fig. |TJ3) all parameters can be estimated well with appropriate precision. 

3.4 Inference for PCR based reporter data 

In the case of reporter assays based on Polymerase Chain Reaction (e.g. RT-PCR, Q-PCR) measurements 
are obtained from the extraction of the molecular contents from the inside of cells. Since the sample is 
sacrificed, the sequence of measurements are not strictly associated with a stochastic process describing the 
same evolving unit. Assume that at each time point t{ (i = 0, ..n) we observe I measurements that are 
proportional to the number of RNA molecules either from a single cell or from a population of s cells. This 
gives a (n+ 1) X 1 matrix of data points 

" , / ( 22 ) 

where u tu j — Xr ti .j + €ti,ji r ti.j is the actual RNA level, A is the proportionality constant, e ti j is a Gaussian 
independent measurement error indexed by time ti\ j = 1, . . . , I indexes the / measurements that are taken 
at time fcj. Note that r t .j and r ti+1 j are independent random variables as they refer to different cells. We 
assume that the dynamics of RNA is described by the simple model of single gene expression with LNA 



equations ( 16 > and (17). Let T t denote the distribution of measured RNA at time t (it t ~ T t ). In order to 
accommodate for the different form of data we modify the estimation procedure as follows. For analytical 
convenience we assumed that the initial distribution is normal T f[) = N(fj, to , of ). This together with eq. d8|) 



Figure 2: Left:PCR based reporter assay data simulated with Gillespie's algorithm using parameters 
presented in Tableland extracted 51 times (n=50), every 30 minutes with an independently and normally 
distributed error (erf = 9). Each cross correspond to the end of simulated trajectory, so that the data drawn 
are of form ( |22[ ) . Since number of RNA molecules is know upto proportionality constant y-axis is in arbitrary 
units. Time on x-axis is expressed in hours. Estimates inferred form this data are shown in column Estimate 
1 in Table [2] Right: Fluorescence level from cycloheximide experiment is plotted against time (in hours). 
Subsequent dots correspond to measurements taken every 6 minutes. 

and normality of measurement error implies that T t = N({i t , of). Simple explicit formulae for fi t and of are 
derived in the SI. Since all observations u tu . are independent we can write the posterior distribution as 

n I 

«-(e|u)« JJIJ^ut^-i^.c^) Tr(e), (23) 

i=0 j=l 

where ^(•|/i ti , erf.) is the normal density with parameters fi ti , of. ■ hi order to infer the vector of the unknown 
parameters — (~fp, A, 60, bi, 62, &3; bi, fit , of o ) we sample from the posterior using a standard MH algorithm. 
To test the algorithm we have simulated a small (I — 10, n = 50, plotted in Figure 2) and a large (I — 100, 
n = 50) data set using Gillespie's algorithm with parameter values given in Table 2. The data were sampled 
discretely every 30 minutes and a standard normal error was added. Initial conditions were sampled from the 
Poisson distribution with mean b^/jp. The estimation results in Table 2 show that parameters can be inferred 
well in both cases even though the number of RNA molecules in the generated data is very small (about 5-35 
molecules). Since subsequent measurements do not belong to the same stochastic trajectory, estimation for 
the model presented here is not straightforward with the diffusion approximation based methods. 

3.5 Estimation of gfp protein degradation rate from cycloheximide experiment 

In this section the method is applied to experimental data. After a period of transcriptional induction, 
translation of gfp was blocked by the addition of cycloheximide (CHX). Details of the experiment are 
presented in the SI. Fluorescence was imaged every 6 minutes for 12. 5h (see Figure 2). Since inhibition may 
not be fully efficient we assume that translation may be occurring at a (possibly small) positive rate kp. 
The model with the LNA is 

<j) P = k P - jpfip, (24) 

d£ P = —-fp£,pdt + y/kp + -fp(j>pdWp. 

The observed fluorescence is assumed to be proportional to the signal with proportionality constant A. For 
comparison we also consider the diffusion approximation for which an exact transition density can be derived 
analytically (see SI for derivation) 

dp = (kp - jpp)dt + y/k P + jppdWp. (25) 
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Parameter 


Prior 


Value 


Estimate 1 


Estimate 2 


1r 


Exp(l) 


0.44 


0.45 (0.35-0.60) 


0.46 (0.42-0.50) 


A 


Exp(lOO) 


1.00 


1.07 (0.90-1.22) 


1.01 (0.95-1.05) 


bo 


Exp(lOO) 


15.00 


13.13 (10.20-15.87) 


14.91 (13.86-15.77) 


bi 


Exp(l) 


0.40 


0.29 (0.14-0.51) 


0.43 (0.32-0.54) 


b 2 


Exp(l) 


0.40 


0.32 (0.12-0.93) 


0.32 (0.21-0.43) 


b 3 


Exp(lO) 


7.00 


7.05 (6.39-7.63) 


6.99 (6.76-7.15) 


bi 


Exp(lOO) 


3.00 


2.97 (2.00-4.18) 


3.10 (2.76-3.43) 


Ho 


Exp(lOO) 


6.76 


6.90 (5.79-7.69) 


6.55 (6.14-6.85) 


-I 


Exp(lOO) 


6.76 


3.52 (0.01-8.99) 


7.59 (5.44-9.49) 



Table 2: Inference results for PCR based reporter assay simulated data Parameter values used to generate 
data, prior distributions used for estimation, posterior median estimates together with 95% credibility in- 
tervals. Estimate 1, Estimate 2 columns relate to small (1=5, n=50) and large (1=100, n=50) sample sizes. 
Variance of the measurement was assumed to be known of = 4. Estimated rates are per hour. 



Par am. 


Prior 


Estimate LNA 


Estimate DA 


IP 


Exp(l) 


0.45 (0.31-0.62) 


0.53 (0.39-0.67) 


kp 


Exp(50) 


0.32(0.10-1.75) 


0.43 (0.16-1.07) 


A 


Exp(50) 


22.79(13.79-36.92) 


23.85(16.31-36.54) 


MO) 


N(ut ,ut ) 


889.03(831.44-945.34) 





Table 3: Inference results for CHX experimental data. Priors, posterior mean and 95% credibility intervals 
obtained from CHX experimental data using the LNA approach and diffusion approximation approach. 
Estimation with the LNA involved one more parameter </>p(0) = \<j>p(Q). Estimated rates are per hour. 

Since incorporation of measurement error for the diffusion approximation based model is not straightforward, 
we assume that measurements were taken without any error to ensure fair comparison between the two 
approaches. Table 3 shows that estimates obtained with both methods are very similar. 

4 Discussion 

The aim of this paper is to suggest the LNA as a useful and novel approach to the inference of biochemical 
kinetics parameters. Its major advantage is that an explicit formula for the likelihood can be derived even 
for systems with unobserved variables and data with additional measurement error. In contrast to the 
more established diffusion approximation based methods [10, 18] the computationally costly methods of data 
augmentation to approximate transition densities and to integrate out unobserved model variables are not 
necessary. Furthermore, this method can also accommodate measurement error in a straightforward way. 
The suggested procedure here is implemented in a Bayesian framework using MCMC simulation to generate 
posterior distributions. The LNA has previously been studied in the context of approximating Poisson birth 
and death processes [20-22, 32] and it was shown that for a large class of models the LNA provides an 
excellent approximation. Furthermore, in [32] it is shown that for the systems with linear reaction rates 
the first two moments of the transition densities resulting from the CME and the LNA are equal. Here we 
propose using the LNA directly for inference and provide evidence that the resulting method can give very 
good results even if the number of reacting molecules is very small. Our experience from previous works 
with diffusion approximation based methods [11, 18] is that their implementation is challenging especially for 
data that are sparsely sampled in time because the need for imputation of unobserved time points leads to 
a very high dimensionality of the posterior distribution. This usually results in highly autocorrelated traces 
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affecting the speed of convergence of the Markov chain. Our method considerably reduces the dimension of 
the posterior distribution to the number of unknown parameters of a model only and is independent of the 
number of unobserved components. Nevertheless it can only be applied to the systems with sufficiently large 
volume, where fluctuations around a deterministic state are relatively close to the mean. 

5 Authors contributions 

MK proposed and implemented the algorithm. CVH performed the cycloheximidc experiment. MK wrote 
the paper with assistance from BF and DAR, who supervised the study. 
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